Skip to content

fix(addie): backfill owner test history + stop dual-write for owner runs#4264

Open
EmmaLouise2018 wants to merge 5 commits intoEmmaLouise2018/unification-pr2-dashboard-triggered-byfrom
EmmaLouise2018/unification-pr3-backfill-and-drop-test-history
Open

fix(addie): backfill owner test history + stop dual-write for owner runs#4264
EmmaLouise2018 wants to merge 5 commits intoEmmaLouise2018/unification-pr2-dashboard-triggered-byfrom
EmmaLouise2018/unification-pr3-backfill-and-drop-test-history

Conversation

@EmmaLouise2018
Copy link
Copy Markdown
Contributor

PR 3 of the #4247 unification stack. Stacked on #4263#4250.

Summary

Two coupled changes that close the dual-write bug #4247 is targeting:

Backfill historical owner-triggered tests into canonical state. Migration 472_backfill_agent_test_history_to_compliance_runs.sql copies every `agent_test_history` row with `user_id IS NOT NULL` into `agent_compliance_runs` as `triggered_by = 'owner_test'`, carrying the source row id in `observations_json.backfill_source_id` for idempotent re-runs (`WHERE NOT EXISTS` guard). Each agent's most-recent backfilled row also upserts into `agent_compliance_status` — last-write-wins on `(agent_url)` per the contract pinned in #4250's race test — so the dashboard's compliance tile immediately reflects a real verdict for any agent that was tested through Addie pre-PR-#4250 and never ran the heartbeat.

Stop the dual write for owner runs. `evaluate_agent_quality` no longer calls `agentContextDb.recordTest()` when the caller owns the agent — that's exactly the dual-write that #4247 is closing. The legacy `recordTest` call is retained ONLY for third-party runs so a stranger who tests someone else's agent still has a session-scoped audit trail in their own `agent_test_history`. Owner-triggered runs persist exclusively to canonical state going forward.

Why

PR #4250 made owner runs write to the canonical tables but kept the legacy `recordTest` write for backward compatibility. Two writers per owner run → drift surface. This PR removes the legacy write for owner runs and backfills the rows that were already written through the legacy path so we don't lose history.

Brian's bar: don't lose data silently. Migration 472's idempotent guard + `backfill_source_id` provenance means a re-run is a no-op and any future audit can trace back to the original `agent_test_history` row.

Out of scope (deferred)

Stacked on

Merge order: #4250#4263 → this PR.

Test plan

  • `tsc --noEmit -p server/tsconfig.json` clean
  • Migration is idempotent — re-running on a database with a partial backfill is a no-op via the `WHERE NOT EXISTS` guard on `backfill_source_id`
  • Staging dry-run: count owner-triggered `agent_test_history` rows pre-migration vs `agent_compliance_runs` rows added — should match exactly
  • Smoke after deploy: run `evaluate_agent_quality` as owner, confirm `agent_compliance_runs` gets a `triggered_by='owner_test'` row AND `agent_test_history` does NOT get a new row
  • Smoke after deploy: third-party runs still write to `agent_test_history` (session audit trail preserved)
  • `/dashboard/agents` history panel shows historical "Your test" badges from the backfill (proves the migration landed and the rendered `triggered_by` is correct)

claude and others added 5 commits May 8, 2026 19:00
…ce state

Closes the 12-hour gap between owner-triggered storyboard runs and the public
/api/registry/agents/:url/compliance endpoint (issue #4247, PR 1 of 4).

When evaluate_agent_quality is triggered by the agent owner, the result is now
written to agent_compliance_status + agent_compliance_runs + agent_storyboard_status
with triggered_by = 'owner_test'. Non-owner runs continue writing to agent_test_history
(deprecated in PR 3). Migration 471 adds 'owner_test' to both triggered_by CHECK
constraints. notifyComplianceChange is intentionally suppressed for owner runs to
prevent iteration-loop Slack spam.

https://claude.ai/code/session_01UNHkGhBXk9XD2dpzvSLdhb
Build-generated output produced by npm run build; matches the tracking
pattern of member-agents-openapi.js and registry.js already in dist/schemas/.

https://claude.ai/code/session_01UNHkGhBXk9XD2dpzvSLdhb
…ns test

Address review feedback from @EmmaLouise2018 on PR #4250:

1. `verdict_source` field on /api/registry/agents/:url/compliance
   — `AgentComplianceDetailSchema` gains optional `verdict_source`:
     'heartbeat' | 'owner_test' | 'manual' | 'webhook' | null
   — `getComplianceStatus` and `bulkGetComplianceStatus` join
     `agent_compliance_runs` via LATERAL subquery (dry_run=false,
     ORDER BY tested_at DESC LIMIT 1) to surface the triggered_by
     of the most recent run.  No migration needed.
   — Endpoint response emits `verdict_source: status.last_triggered_by`.
   — `AgentComplianceStatus` interface gets `last_triggered_by` field.

2. Last-write-wins contract test
   — New `compliance-db-last-write-wins.test.ts` pins the ON CONFLICT
     DO UPDATE semantics: every recordComplianceRun call overwrites
     agent_compliance_status regardless of triggered_by source.  A
     future change to first-write-wins or priority ordering would
     break these tests.

https://claude.ai/code/session_01NVVqgeSGevUGXgDbMw1XKZ
PR 2 of the #4247 unification stack. Reads two fields PR #4250 added
to the compliance API but the dashboard wasn't yet rendering:

- compliance tile: appends "(your test)" / "(heartbeat)" / "(manual)"
  / "(webhook)" after Last checked, so operators see whether the
  current verdict came from their own evaluate_agent_quality run or
  the scheduled heartbeat.
- history panel: per-run badge with the same source label, info-blue
  for owner_test and neutral for the rest. Pre-PR-1 rows render with
  neutral — no regression.

No backend changes; pure UI surfacing of fields already in the API.
Stacked on PR #4250.
PR 3 of the #4247 unification stack.

Migration 472 backfills every agent_test_history row with a user_id
into agent_compliance_runs as triggered_by='owner_test', carrying the
source id in observations_json.backfill_source_id for idempotent re-runs.
Each agent's latest backfilled row upserts into agent_compliance_status
so the dashboard immediately reflects a real verdict for agents tested
through Addie pre-PR-#4250.

evaluate_agent_quality stops calling recordTest() when the caller owns
the agent — that was the dual-write that #4247 is closing. recordTest
is retained ONLY for third-party runs so strangers testing someone
else's agent still have a session-scoped audit trail.

Drop of agent_test_history table is deferred behind the 14-day soak
from #4250 + 7-day soak from #4263 + S3 cold-storage export of
non-owner rows. Migration 472 documents this in its trailing comment.

Stacked on #4263#4250.
@bokelley
Copy link
Copy Markdown
Contributor

bokelley commented May 9, 2026

Code review (expert pass): block — three issues that need addressing before this lands.

1. Backfill is not chunked or transactional safety-netted.
server/src/db/migrations/472_backfill_agent_test_history_to_compliance_runs.sql:31-58 is a single bulk INSERT … SELECT. On any prod-sized agent_test_history, this holds locks the whole time and is a single point of failure. Per repo convention (see feedback_prod_runnable_scripts_path.md precedent), prod backfill belongs in server/src/scripts/ with chunked windows + dry-run, not a release-command migration. At minimum, chunk by started_at window inside a DO $$ block.

2. Status-priority contract drift between PR and #4250's test.
Migration 472:80-122 implements first-newer-wins via last_checked_at < EXCLUDED.last_checked_at guards on the upsert. PR #4250's compliance-db-last-write-wins.test.ts asserts plain last-write-wins. The PR body says "heartbeat always wins on freshness" — pick one. The test in #4250 will pass at #4250 time and start failing the contract this migration enforces. Reconcile before merge.

3. Partial dual-write window is unguarded.
Between #4250 deploy (canonical writes start) and this PR's deploy (backfill runs), owner runs write to BOTH canonical and agent_test_history. The migration's WHERE NOT EXISTS … backfill_source_id guard correctly skips already-backfilled rows — but owner runs in that window have no backfill_source_id linkage, so they'll be inserted twice into agent_compliance_runs (once by #4250's live write path, once by the backfill).

Add a guard: AND ath.started_at < (SELECT MIN(tested_at) FROM agent_compliance_runs WHERE triggered_by='owner_test') or equivalent — only backfill rows from before the canonical-write cutover.

#4263 is clean and #4268 is solid; this PR is the chain blocker.

@EmmaLouise2018 EmmaLouise2018 force-pushed the EmmaLouise2018/unification-pr2-dashboard-triggered-by branch from 47b26d4 to fd7d406 Compare May 9, 2026 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants